Snarls, pangenome deconstruction, and read mapping with vg giraffe

GET-A-PAN

Xian Chang & Jean Monlong

06/11/2025

Pangenome

as a birected graph

Pangenome

as a birected graph with haplotype paths

Snarls, intuitively a “bubble”

Snarls, formal definition

A snarl is a subgraph bounded by two node sides that are:

Snarls, formal definition

A snarl is a subgraph bounded by two node sides that are:

  1. Separable: splitting the boundary nodes separates the snarl from the graph

Snarls, formal definition

A snarl is a subgraph bounded by two node sides that are:

  1. Separable: splitting the boundary nodes separates the snarl from the graph
  2. Minimal: no nodes within the snarl are separable with either boundary node side

Snarls, formal definition

A snarl is a subgraph bounded by two node sides that are:

  1. Separable: splitting the boundary nodes separates the snarl from the graph
  2. Minimal: no nodes within the snarl are separable with either boundary node side

Chains

A run of consecutive snarls and nodes is called a chain.

Snarl decomposition

Snarls and chains can be nested inside of each other.

Snarls, chains, and nodes in a tree

The nested relationship of snarls and chains is described by the snarl tree.

Netgraphs

Netgraphs are a representation of snarls with their child chains collapsed into a single node

Deconstruct-ing a pangenome

Enumerate alleles for each snarl on a reference path. Before: all paths.

Deconstruct-ing a pangenome

Enumerate alleles for each snarl on a reference path. Now: only haplotypes.

Deconstruct-ing a pangenome

Enumerate alleles for each snarl on a reference path inc. nested snarls.

Snarl examples (vg deconstruct)

##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample1
ref 6   >1>4    GGCAC   CTTAG   60  .   AT=>1>2>4,>1>3>4;LV=0   GT  0|1
ref 15  >4>9    CCCAGG  CCGGTAACTACCGTCACCAGG,CCGGTACGTCA   60  .   AT=>4>8>9,>4>5>6>7>8>9,>4>5>7>9;LV=0    GT  1|2

Snarl examples (vg deconstruct)

##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample1
ref 10  >1>7    AAA AAAAAA,AAAA 60  .   AT=>1>5>6>7,>1>2>3>4>5>6>7,>1>4>5>6>7;NS=1;LV=0 GT  1|2

Snarl examples (vg deconstruct)

##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">
##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">
##contig=<ID=ref#0#0,length=56>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample1 sample2
ref 11  >2>5    CTTAG   AAGTC   60  .   AT=>2>3>5,>2>4>5;LV=0   GT  0|0 1|.

Snarl examples (vg deconstruct)

Trick for getting this snarl decomposition to look better (currently only for the distance index):

vg index -j [graph.dist] -w 6

Mapping reads with vg giraffe

Short reads

Long reads

Read mapping algorithm overview

Read mapping algorithm overview

  1. Seeding

Read mapping algorithm overview

  1. Seeding
  2. Clustering/chaining

Read mapping algorithm overview

  1. Seeding
  2. Clustering/chaining

Read mapping algorithm overview

  1. Seeding
  2. Clustering/chaining
  3. Alignment

Read mapping algorithm overview

  1. Seeding
  2. Clustering/chaining
  3. Alignment

Read mapping algorithm overview

  1. Seeding
  2. Clustering/chaining
  3. Alignment

Short read giraffe algorithm

  1. Seeding
  2. Clustering/chaining
  3. Alignment

Short read giraffe algorithm

  1. Seeding with Minimizer Index
  2. Clustering/chaining
  3. Alignment

Short read giraffe algorithm

  1. Seeding
  2. Clustering with Distance Index
  3. Alignment

Short read giraffe algorithm

  1. Seeding
  2. Clustering with Distance Index
  3. Alignment

Short read giraffe algorithm

  1. Seeding
  2. Clustering
  3. Alignment

Short read giraffe algorithm

  1. Seeding
  2. Clustering
  3. Gapless extension

Short read giraffe algorithm

  1. Seeding
  2. Clustering
  3. Gapless extension with GBWT

Short read giraffe algorithm

  1. Seeding
  2. Clustering
  3. Gapless extension with GBWT

Short read giraffe algorithm

  1. Seeding
  2. Clustering
  3. Gapless extension
  4. Gapped alignment with graph

Short read giraffe algorithm

  1. Seeding
  2. Clustering
  3. Gapless extension
  4. Gapped alignment with graph

Short read giraffe algorithm

  1. Seeding
  2. Clustering
  3. Gapless extension
  4. Gapped alignment

Long read giraffe algorithm

  1. Seeding
  2. Clustering/chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding with Minimizer Index
  2. Clustering/chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Chaining
  3. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making with Distance Index
  3. Chaining with Zip code trees
  4. Alignment

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making
  3. Chaining
  4. Alignment with graph

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making
  3. Chaining
  4. Alignment with graph

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making
  3. Chaining
  4. Alignment with graph

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making
  3. Chaining
  4. Alignment with graph

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making
  3. Chaining
  4. Alignment with graph

Long read giraffe algorithm

  1. Seeding
  2. Zip code tree making
  3. Chaining
  4. Alignment

Giraffe results

On the HPRC v2 graph which is x size?

vg graph formats and indexes

Indexes

  1. .gbwt (Graph Burrows Wheeler Transform): haplotype paths
  2. .gg (GBWT Graph): node sequences for a GBWT
  3. .dist (Distance Index): snarl decomposition plus minimum distances
  4. .zipcodes: per-node distance information used by vg giraffe
  5. .min (Minimizer Index): minimizers used by vg giraffe
  6. .gcsa (Generalized Compressed Suffix Array): substring index used by vg map and vg mpmap

Graphs

  1. .gbz (GBWT + GG): the graph induced by the GBWT
  2. .hg (/.vg) (HashGraph): graph format optimized for speed
  3. .pg (/.vg) (PackedGraph): graph format optimized for space efficiency
  4. .xg: older graph format
  5. .vg: protobuf-based graph format

Conclusion

Useful resources